You can run through this scenario on Katacoda here.
This scenario is built using the following commands in the Terminal pane in RStudio.
# create a new scenario
? Friendly url: format-shape-data
? Scenario Title: Format and shape your data in R with the tidyverse
? Scenario Description: Quickly wrangle your data with the tidyr and dplyr packages from the tidyverse
? Difficulty Level: Intermediate
? Estimated time: (Please specify in minutes or hours e.g. 2 hours) 40
? Number of Steps (Not including intro & finish): 12
? Image: rlang:3.4
? Layout: 2x Terminal SplitThe following folder was created from these commands.
#> ..
#> ├── code
#> ├── data
#> ├── docs
#> ├── figs
#> ├── finish.md
#> ├── index.json
#> ├── intro.md
#> ├── step1.md
#> ├── step10.md
#> ├── step11.md
#> ├── step12.md
#> ├── step2.md
#> ├── step3.md
#> ├── step4.md
#> ├── step5.md
#> ├── step6.md
#> ├── step7.md
#> ├── step8.md
#> └── step9.md
We can see all steps, the intro, and finish markdown files. There is also an index.json file for managing all the steps in the Katacoda environment.
Katacoda comes with it’s own flavor of Markdown, which you can learn more about here.. The main thing to know is that there are execute and copy extensions, which you can use in the markdown to indicate whether Katacoda should copy or run the code inside the R REPL.
The learner personas for this scenario are:
Horatio is an executive at a tech company who wants to learn more about data manipulation with R and Python
Judy is a product manager at a start-up and needs to restructure data before loading it into their database.
Andrew is a graduate student in college and needs to analyze his thesis/dissertation project.
We rarely collect data in a way that can be immediately analyzed or visualized. “Wrangling,” “munging,” or “cleaning” are the steps to prepare data for a table, graph, algorithm, or model. These terms might sound like they’re referring to tedious, menial work, but data wrangling is such a common task that most data scientists consider it up to 80% of their job.
This scenario will introduce the following topics in R:
tidyverseIn this scenario, we’ll be using the tidyverse packages, “an opinionated collection of R packages designed for data science”
Launch an R console by clicking here -> R{{execute}}
You are looking at the R REPL (read-eval-print-loop) Katacoda has designed to run R in our browser.
R is an object-oriented, functional programming scripting language. R is free and open-source software (FOSS) with a massive global community of users and developers who have helped create and maintain tools for data manipulation, graphics, statistics, and machine learning.
Generally speaking, the R language has functions, commands, and operators.
Functions take inputs and return outputs:
function('input') {
perform command(s) on 'input'
return output
}
outputCommands are verbs we use to tell the R to do something. Unlike functions, commands don’t always return an output.
For example, install.packages() downloads and installs R packages into a local folder on our computer, and the library() command loads the packages.
install.packages("tidyverse")
library(tidyverse)Operators are symbols (or collections of symbols) for performing arithmetic (+, -, *, /), Boolean (logical) operations (TRUE or FALSE), comparisons (<, >, =<, =>), and assignment (<- and =).
R packages are collections of commands for a particular purpose or task. R comes ‘out of the box’ with a handful of useful commands.
For this scenario, we’ll be using the tidyr and dplyr packages for data manipulation. Both packages are part of the tidyverse, which is a suite of tools pioneered by RStudio’s Chief Scientist Hadley Wickham. All packages in the tidyverse work well together because they center around a common thread of tidy data.
When we load data into R, these data get stored in a data object. To do things to any data object (manipulate, analyze, visualize, model, etc.), we’ll need to use functions. We can write custom functions, or we can use one of the over 10,000 user-written packages available on CRAN.
Manipulating data in R requires sending commands to the Terminal. Sometimes these are commands we type ourselves, other times we will copy + paste code we can adapt from another source (like Stack Overflow). THIS IS A NORMAL PART OF CODING. Feel free to adapt and experiment with code you find (provided it’s not someone else’s private work).
Install the tidyverse package by clicking on the ‘copy’ icon below and pasting the code into the Terminal window, then hitting enter or return.
Ctrl+V
or
Cmd+V
# click to copy code
install.packages("tidyverse")Load the package with the base::library() function.
# click to copy code
library(tidyverse)The code in this scenario follows the tidyverse style guide as closely as possible.
“Each line of a comment should begin with the comment symbol and a single space:
#”
# comments aren't run in the terminal“In data analysis code, use comments to record important findings and analysis decisions. If you need comments to explain what your code is doing, consider rewriting your code to be clearer.”
If we want to use a function from a package, the syntax for doing this is package::function()
For example, below, we’ll use the tidyverse_logo() function from the tidyverse package to view an awesome logo.
# click to execute code
tidyverse::tidyverse_logo()#> ⬢ __ _ __ . ⬡ ⬢ .
#> / /_(_)__/ /_ ___ _____ _______ ___
#> / __/ / _ / // / |/ / -_) __(_-</ -_)
#> \__/_/\_,_/\_, /|___/\__/_/ /___/\__/
#> ⬢ . /___/ ⬡ . ⬢
The tidyverse::tidyverse_logo() function can run without any arguments (i.e. nothing inside the parentheses), but we can view the arguments by placing the cursor inside the parenthesis and hitting the tab key.
We can enter function arguments by position or name (see below).
It’s hard to learn any of R’s capabilities without a dataset. There are multiple ways to get datasets into the R environment, but we’ll cover that in another scenario.
Today we’ll be using a dataset from the fivethirtyeight package in R. This package has over 100 datasets from articles on the website FiveThirtyEight.
The dataset we will be using today comes from the article “A Statistical Analysis of the Work of Bob Ross.”
Bob Ross was the host of The Joy of Painting, a painting educational program from 1980 - 1994.
We’ll load the BobRoss dataset into R using fivethirtyeight::bob_ross.
First, we have to install and load the package into the R environment.
# click to execute code
install.packages("fivethirtyeight")
library(fivethirtyeight)Now we can assign the bob_ross data frame into the BobRoss object.
# click to execute code
BobRoss <- fivethirtyeight::bob_rossAs we noted earlier, functions are like verbs in the R language, and the data we’ve created is the object. Similar to English grammar, the verbs (functions) do things to the objects. We can build data objects in R using a variety of methods, but typically we’ll be loading data from an outside source into the R environment.
# click to execute code
# print data
BobRossWe should pay attention to the information printed in BobRoss. As we can see, it’s contained in a tibble.
# A tibble: 403 x 71
episode season episode_num title apple_frame aurora_borealis barn beach
<chr> <dbl> <dbl> <chr> <int> <int> <int> <int>
1 S01E01 1 1 A WA… 0 0 0 0
2 S01E02 1 2 MT. … 0 0 0 0
3 S01E03 1 3 EBON… 0 0 0 0
4 S01E04 1 4 WINT… 0 0 0 0
5 S01E05 1 5 QUIE… 0 0 0 0
6 S01E06 1 6 WINT… 0 0 0 0
7 S01E07 1 7 AUTU… 0 0 0 0
8 S01E08 1 8 PEAC… 0 0 0 0
9 S01E09 1 9 SEAS… 0 0 0 1
10 S01E10 1 10 MOUN… 0 0 0 0
# … with 393 more rows, and 63 more variables: boat <int>, bridge <int>,
# building <int>, bushes <int>, cabin <int>, cactus <int>,
# circle_frame <int>, cirrus <int>, cliff <int>, clouds <int>, conifer <int>,
# cumulus <int>, deciduous <int>, diane_andre <int>, dock <int>,
# double_oval_frame <int>, farm <int>, fence <int>, fire <int>,
# florida_frame <int>, flowers <int>, fog <int>, framed <int>, grass <int>,
# guest <int>, half_circle_frame <int>, half_oval_frame <int>, hills <int>,
# lake <int>, lakes <int>, lighthouse <int>, mill <int>, moon <int>,
# mountain <int>, mountains <int>, night <int>, ocean <int>,
# oval_frame <int>, palm_trees <int>, path <int>, person <int>,
# portrait <int>, rectangle_3d_frame <int>, rectangular_frame <int>,
# river <int>, rocks <int>, seashell_frame <int>, snow <int>,
# snowy_mountain <int>, split_frame <int>, steve_ross <int>, structure <int>,
# sun <int>, tomb_frame <int>, tree <int>, trees <int>, triple_frame <int>,
# waterfall <int>, waves <int>, windmill <int>, window_frame <int>,
# winter <int>, wood_framed <int>
tibbles print the dimensions of the dataset (# A tibble: 403 x 71), numerical indices for rows (far left), and the type of variable in the dataset (<chr> variables have text information vs. <dbl> and <int> variables which contain numbers), and additional information about the data when it’s too big to print on the screen (… with 393 more rows, and 63 more variables:). Read more about tibbles in this chapter of R for Data Science.
One quick and easy way to get data into R is to create the data ‘by hand’ using the tibble::tribble() function. Read more about this function here in R For Data Science.
These data are stored as a comma-separated values file on Github (see below).
title, bushes, clouds,
"A WALK IN THE WOODS", 1L, 0L,
"MT. MCKINLEY", 0L, 1L,
"EBONY SUNSET", 0L, 0L,
"WINTER MIST", 1L, 1L,
"QUIET STREAM", 0L, 0L,
"WINTER MOON", 0L, 0L,
"AUTUMN MOUNTAINS", 0L, 0L,
"PEACEFUL VALLEY", 1L, 0L,
"SEASCAPE", 0L, 1L,
"MOUNTAIN LAKE", 1L, 0L,
"WINTER GLOW", 0L, 0L,
"SNOWFALL", 0L, 1L,
"FINAL REFLECTIONS", 1L, 0L,
"MEADOW LAKE", 1L, 0L,
"WINTER SUN", 0L, 0L
A description of these variables is below:
title: Title of episode
bushes: Present (1) or not (0)
clouds: Present (1) or not (0)
Both data frames (called data.frames in R) and tibbles are rectangular data objects. They have columns and rows for storing data, similar to a spreadsheet in Excel. The main difference between a tibble and a data.frame is how they are printed to the console. We are going to create a tibble and data.frame with the data above to demonstrate these differences.
tibbleThe tibble::tribble() function takes column names preceded by the tilde (i.e. ~column), then get separated by commas (~column1, ~column2, ~column3). Then tibble::tribble() reads the data starting on the second line, similar to the way the actual data are stored above.
The tibble we create below is a small sample from the bob_ross dataset. The code below creates BobRossTibble from the data above in the R environment.
# click to execute code
BobRossTibble <- tibble::tribble(
~title, ~bushes, ~clouds,
"A WALK IN THE WOODS", 1L, 0L,
"MT. MCKINLEY", 0L, 1L,
"EBONY SUNSET", 0L, 0L,
"WINTER MIST", 1L, 1L,
"QUIET STREAM", 0L, 0L,
"WINTER MOON", 0L, 0L,
"AUTUMN MOUNTAINS", 0L, 0L,
"PEACEFUL VALLEY", 1L, 0L,
"SEASCAPE", 0L, 1L,
"MOUNTAIN LAKE", 1L, 0L,
"WINTER GLOW", 0L, 0L,
"SNOWFALL", 0L, 1L,
"FINAL REFLECTIONS", 1L, 0L,
"MEADOW LAKE", 1L, 0L,
"WINTER SUN", 0L, 0L)
BobRossTibbledata.frameThe base::data.frame() function creates columns transposed, with values displayed left to right). We can create data.frames with a series of named vectors, or supply them directly to the function (both work).
Below we create three vectors (title, bushes, and clouds), then supply these to the base::data.frame() function and assign it to BobRossDataFrame. The names of the vectors end up being the column names.
# click to execute code
title <- c("A WALK IN THE WOODS", "MT. MCKINLEY", "EBONY SUNSET",
"WINTER MIST", "QUIET STREAM", "WINTER MOON", "AUTUMN MOUNTAINS",
"PEACEFUL VALLEY", "SEASCAPE", "MOUNTAIN LAKE", "WINTER GLOW",
"SNOWFALL", "FINAL REFLECTIONS", "MEADOW LAKE", "WINTER SUN")
bushes <- c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L)
clouds <- c(0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L)
BobRossDataFrame <- data.frame(title, bushes, clouds)
BobRossDataFrameThis is the same data.frame, but with the vectors supplied as arguments inside the data.frame() function.
# click to execute code
BobRossDataFrame <- data.frame(
title = c("A WALK IN THE WOODS", "MT. MCKINLEY", "EBONY SUNSET",
"WINTER MIST", "QUIET STREAM", "WINTER MOON", "AUTUMN MOUNTAINS",
"PEACEFUL VALLEY", "SEASCAPE", "MOUNTAIN LAKE", "WINTER GLOW",
"SNOWFALL", "FINAL REFLECTIONS", "MEADOW LAKE", "WINTER SUN"),
bushes = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L),
clouds = c(0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L))
BobRossDataFrameThere are a few things to pay attention to in the code output above:
<- is used to create the SmallBobRoss tibble, which is a data objecttibbles print out the first ten rows, and the following header and footer:# header
# A tibble: 15 x 3
title bushes clouds
<chr> <int> <int>
# footer
# … with 5 more rowsThis header tells us what kind of variables are in the tibble (<chr> vs. <int>)
data.frames print the entire rectangle to the console title bushes clouds
1 A WALK IN THE WOODS 1 0
2 MT. MCKINLEY 0 1
3 EBONY SUNSET 0 0
4 WINTER MIST 1 1
5 QUIET STREAM 0 0
6 WINTER MOON 0 0
7 AUTUMN MOUNTAINS 0 0
8 PEACEFUL VALLEY 1 0
9 SEASCAPE 0 1
10 MOUNTAIN LAKE 1 0
11 WINTER GLOW 0 0
12 SNOWFALL 0 1
13 FINAL REFLECTIONS 1 0
14 MEADOW LAKE 1 0
15 WINTER SUN 0 0We see more data, but know less about the format of the variables.
The tidyverse readr package has functions for loading multiple rectangular file types, including comma-separated value, tab-separated value, and other fixed width format files.
The code below allows us to import data directly from a web URL. The bitly link takes us to a comma-separated values (.csv) file with the same data we loaded in the previous step.
# click to execute code
SmallBobRoss <- readr::read_csv("https://bit.ly/small-bob-ross")
SmallBobRossThe code above loads the data from an external source (see the data here) into a data object SmallBobRoss, then prints this object to the screen.
After these data are loaded into R, we get a message about how the data were formatted,
Parsed with column specification:
cols(
title = col_character(),
bushes = col_double(),
clouds = col_double()
)col_character() means these data are text, which makes sense because they are the titles for the episodes. The col_double() tells us the bushes and clouds variables were imported as double (a kind of numerical variable in R).
We can view the BobRoss dataset using dplyr’s glimpse() function, which shows the data in a transposed view (glimpse() displays the variables horizontally, and prints as much data as possible to the screen.
# click to execute code
glimpse(SmallBobRoss)#> Rows: 5
#> Columns: 3
#> $ title <chr> "A WALK IN THE WOODS", "MT. MCKINLEY", "EBONY SUNSET", "WINT…
#> $ bushes <dbl> 1, 0, 0, 1, 0
#> $ clouds <dbl> 0, 1, 0, 1, 0
We like to think of wrangling on two different levels. The first level deals with the data shape and structure. The second level of data wrangling refers to the format of individual variables (which we will get to in the following steps).
Questions we should be asking ourselves about data at this level include:
A common task for data manipulation is moving columns to rows, or rows to columns. The tidyr package in R makes this easy with two tidyr::pivot_ functions.
dplyr::glimpse() is a convenient function for examining the structure and shape of a dataset. Other options include utils::str() and utils::head().
# click to execute code
head(BobRoss)We also want to know if there are duplicate rows in the BobRoss data, and we can check this by using dplyr::distinct() and base::nrow() with base::identical()
Number of rows?
# click to execute code
base::nrow(BobRoss)#> [1] 403
distinct rows (note this returns a tibble!)
# click to execute code
dplyr::distinct(.data = BobRoss)how about we check to see if they’re identical?
# click to execute code
identical(x = nrow(BobRoss), y = nrow(dplyr::distinct(BobRoss)))#> [1] TRUE
If you see TRUE, this means all the rows are unique!
Now we can answer more of the questions above:
As we can see in the Terminal, the BobRoss dataset is full of the variables named for the various objects in his paintings. From apple_frame down to wood_framed, the objects have what appears to be 0s and 1s for values. These are called indicators, or binary variables.
The columns in BobRoss represent various objects in Bob Ross’s paintings, and the values in the rows are whether or not the object was present or absent in a particular episode. An image of this data arrangement is below:
Can you think of another way to organize the same information? What if we changed the original data from having a column for each different painting object to a format with only two columns: the name of the painting object, and whether or not it was present. Take a look at the data format below:
object column keeps track of the thing in the painting (apple_frame, aurora_borealis, barn, etc.),present column corresponds to the number of times that particular thing occurred,season, episode, episode_num, and title information repeats down the rows for every object that was initially in a columnData arrangements like this are typically called long or tidy, but the vital thing to know is that each value has an index in two places (in this case object and present). We can convert BobRoss into a long dataset using tidyr::pivot_longer().
The tidyr::pivot_longer() function takes the following arguments:
BobRoss),c(apple_frame:wood_framed) is shorthand for apple_frame through wood_framed),names_to = 'object'),values_to = 'present')Click anywhere in the dark area below to run the code and see the result.
# click to execute code
BobRossLong <- pivot_longer(data = BobRoss,
cols = c(apple_frame:wood_framed),
names_to = 'object',
values_to = 'present')
# view data
head(BobRossLong)Now that we have two datasets in the R environment, we can compare their structures with dim() (short for data dimensions).
# click to execute code
dim(BobRoss)#> [1] 403 71
This is telling us there are 403 rows and 71 columns in BobRoss…
# click to execute code
dim(BobRossLong)#> [1] 27001 6
…vs. 27001 rows and 6 columns in BobRossLong!
As we can see, the BobRossLong has a ton more rows, but far fewer columns. The dimensions of the dataset have changed, but we’ve retained the information.
But what if we want to keep the dataset in it’s original ‘wide’ format? The tidyr::pivot_wider() is the complement to tidyr::pivot_longer(), and it takes the following arguments:
BobRossLong),names_from = object),values_from = present)# click to execute code
BobRossWide <- pivot_wider(data = BobRossLong,
names_from = object,
values_from = present)
head(BobRossWide)Does the BobRossWide dataset have the same information as the original BobRoss dataset? We can check with dplyr::setdiff() which will test for differences in the two tibbles:
# click to execute code
dplyr::setdiff(x = BobRoss, y = BobRossWide)Notice this returned an empty tibble? That means BobRoss and BobRossWide are identical (i.e. no set differences).
We’ve just shown two formats with the same information in them. You might be wondering which is better, and the answer is it depends. R prefers datasets formatted as long, but there are other reasons you might want to store (or collect) data in the wide format.
Fortunately, now you don’t have to choose because you can easily change whatever arrangement the data are in!
Sometimes data are not as granular as we would like them to be, so we need to break up bits of information into individual columns.
For example, the episode column contains information on the season and number from season and episode_num. But what if these two columns were missing? We could recreate them using tidyr::separate().
We will load an alternative version of BobRossLong into the R environment for this step.
Please click on anywhere in the dark area below to run the code below and load BobRossStep7 into R.
# click to execute code
BobRossStep7 <- readr::read_csv(file = "https://bit.ly/bob-ross-step7")
glimpse(BobRossStep7)#> Rows: 27,001
#> Columns: 4
#> $ episode_info <chr> "S01, E01", "S01, E01", "S01, E01", "S01, E01", "S01, …
#> $ title <chr> "A WALK IN THE WOODS", "A WALK IN THE WOODS", "A WALK …
#> $ object <chr> "apple_frame", "aurora_borealis", "barn", "beach", "bo…
#> $ present <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
As we can see, BobRossStep7 has an episode_info variable with both season and episode information separated by a comma and space ("S01, E01" and "S01, E01",, etc.).
We want to use tidyr::separate() to split the episode_info column into season and episode using the following arguments.
data = the name of the dataset (BobRossStep7)col = the name of the column to separate (episode_info),into = the new columns for the separated information (combine these with c("season", "episode"))Copy the code below and replace c("______", "_______") with the two new column names before pasting it back into the Terminal and running it.
# click to copy code
separate(data = BobRossStep7, col = episode_info, into = c("______", "_______"))separate(data = BobRossStep7, col = episode_info, into = c("season", "episode"))If you’ve completed this step correctly, you should see the following dataset:
# A tibble: 27,001 x 5
season episode title object present
<chr> <chr> <chr> <chr> <dbl>
1 S01 E01 A WALK IN THE WOODS apple_frame 0
2 S01 E01 A WALK IN THE WOODS aurora_borealis 0
3 S01 E01 A WALK IN THE WOODS barn 0
4 S01 E01 A WALK IN THE WOODS beach 0
5 S01 E01 A WALK IN THE WOODS boat 0
6 S01 E01 A WALK IN THE WOODS bridge 0
7 S01 E01 A WALK IN THE WOODS building 0
8 S01 E01 A WALK IN THE WOODS bushes 1
9 S01 E01 A WALK IN THE WOODS cabin 0
10 S01 E01 A WALK IN THE WOODS cactus 0
# … with 26,991 more rowsAs we can see, season and episode contain the same information as the previous episode_info column.
Just like with the two pivot_ functions, there is an opposite version of tidyr::separate() called tidyr::unite(). The unite() function takes multiple columns and sticks them together into a single new column.
We’ve accumulated quite a few datasets in our R environment, which can make things seem cluttered. You can check the objects in R using ls().
# click to execute code
ls()#> [1] "BobRoss" "BobRossDataFrame" "BobRossLong" "BobRossStep7"
#> [5] "BobRossTibble" "BobRossWide" "bushes" "clouds"
#> [9] "SmallBobRoss" "title"
Yikes–so many Bob Ross’s! We will remove the SmallBobRoss, BobRoss and BobRossStep7 datasets using rm().
# click to execute code
rm(SmallBobRoss, BobRoss, BobRossStep7)Now check again with ls() just to make sure (you can remove any additional datasets with rm()).
unite()We’ll load another alternative version of BobRossLong into R and view it with head()
# click to execute code
BobRossStep8 <- readr::read_csv(file = "https://bit.ly/bob-ross-step8")
head(BobRossStep8)Now use tidyr::unite() with the following arguments:
data = BobRossStep8
the columns we want to unite (season_text and episode_text)
the new column name (col = episode_new)
And sep =, a regular expression pattern to place between the two columns we’re uniting (in this case, it’s E).
Replace col = '___________' below with the new united column name before entering and running it in the Terminal.
# click to copy code
unite(data = BobRossStep8, season_text, episode_text, col = '___________', sep = "E")unite(data = BobRossStep8, season_text, episode_text, col = 'episode_new', sep = "E")If you’ve filled in the function correctly, you should see the following data set:
# A tibble: 27,001 x 4
episode_new title object present
<chr> <chr> <chr> <dbl>
1 S01E01 A WALK IN THE WOODS apple_frame 0
2 S01E01 A WALK IN THE WOODS aurora_borealis 0
3 S01E01 A WALK IN THE WOODS barn 0
4 S01E01 A WALK IN THE WOODS beach 0
5 S01E01 A WALK IN THE WOODS boat 0
6 S01E01 A WALK IN THE WOODS bridge 0
7 S01E01 A WALK IN THE WOODS building 0
8 S01E01 A WALK IN THE WOODS bushes 1
9 S01E01 A WALK IN THE WOODS cabin 0
10 S01E01 A WALK IN THE WOODS cactus 0
# … with 26,991 more rowsAs we can see, unite() placed the E between season_text and episode_text.
Now we should feel more comfortable answering the last two questions related to this level of data wrangling:
Does each measurement have a column (or variable)? Yes, and if not I can pivot the data into another arrangement
Are these variables each measuring exactly one thing? Yes, and if not I can separate/unite columns to capture that one thing
We said we like to think of the first level of data wrangling as changes to the data structure itself (what the dimensions, columns, and rows should be). The second level of data wrangling refers to creating or calculating new variables based on existing columns and values.
You might be wondering how these two are different, and the primary difference is that all the changes we made used only the position or location of the data. For example, consider the data arrangement below:
This data has three months spread across columns and a category variable that serves as an index for the values. If we put these data in the long format, it will look like the image below:
The tidyr::pivot_longer() changes the position of the indices and values, but doesn’t calculate or create any new information. We can just as quickly move the data back into its original arrangement (see below).
unite and separate() also change the position and contents of the information, but they do not calculate or create new values.
dplyr packageThe primary package for data manipulation at the second level is dplyr, and we will explore its functions in the next three steps.
How dplyr::mutate() works:
dplyr::mutate(.data = DataFrame,
# changed variable
`new variable name` =
# function used to change variable
some_function(
# current variable name
`old variable name`))As you can see from the code and comments above, first, we enter the data set (DataFrame). Next, we introduce a name for the new variable we want to create (new variable name), the equals sign =, then the function we want to apply some_function() and the original variable we want to apply it to old variable name.
Let’s import a new version of BobRoss for this step.
BobRossStep9 <- readr::read_csv(file = "https://bit.ly/bob-ross-step9")We’re going to use the dplyr::mutate() function to change the format of a character variable (<chr>) to numeric (<dbl>). So if we run the code below, it should create episode_num, which is a numerical version episode_txt (converting <chr> to <dbl> to numeric).
# click to execute code
dplyr::mutate(.data = BobRossStep9, episode_num = as.numeric(episode_txt))We can see this creates a new variable, but we still have episode_txt. Next, we’ll convert the title to title case using the stringr package (also from the tidyverse), and we’ll overwrite the existing variable by naming the new variable title as well.
dplyr::mutate(.data = BobRossStep9,
# first new variable
episode_num = base::as.numeric(episode_txt),
# second new variable
title = stringr::str_to_title(title))The great thing about dplyr::mutate() and other tidyverse functions is that we can view our work before assigning anything to a dataset. It’s always a good idea to check your data wrangling before assigning it back to the data frame.
We will make one final change to the object variable by removing the underscores between the objects with stringr::str_replace_all().
dplyr::mutate(.data = BobRossStep9,
# first new variable
episode_num = base::as.numeric(episode_txt),
# second variable change
title = stringr::str_to_title(title),
# third variable change
object = stringr::str_replace_all(string = object,
pattern = "_",
replacement = " "))Now that we can see the variables are formatted how we want, we can assign the changes to the BobRossStep9 object.
BobRossStep9 <- dplyr::mutate(.data = BobRossStep9,
# first new variable
episode_num = base::as.numeric(episode_txt),
# second variable change
title = stringr::str_to_title(title),
# third variable change
object = stringr::str_replace_all(string = object,
pattern = "_",
replacement = " "))
head(BobRossStep9)Read more about stringr package.
Many times we’ll want to create a new variable based on the values in another variable. For example, assume we want to identify only those objects that are in season 1. We can do this by creating a new variable called season01, and assigning the value of TRUE to all the objects that occurred in season 1, and FALSE to every other object.
Let’s import a dataset for this step.
# click to execute code
BobRossStep10 <- readr::read_csv(file = "https://bit.ly/bob-ross-step10")
head(BobRossStep10)Create season01 by filling in the the appropriate logical values (TRUE/FALSE) for true = and false =.
# click to copy code
mutate(.data = BobRossStep10, season01 = if_else(season == 1, true = ____, false = _____))mutate(.data = BobRossStep10, season01 = if_else(season == 1, true = TRUE, false = FALSE))We notice this creates a <lgl> variable, which can only have one of two values (TRUE and FALSE). Logical is excellent for binary variables because you can still perform mathematical operations on them.
Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE mapped to 1, FALSE to 0
Assuming we know how many episodes are season 1, we can check the new variable using dplyr::count().
# click to execute code
BobRossStep10 <- dplyr::mutate(.data = BobRossStep10,
season01 = if_else(condition = season == 1,
true = TRUE,
false = FALSE))
# count the new values
dplyr::count(BobRossStep10, season01)count returns a tibble with a sum of each value in season01.
We can also use if_else in combination with stringr::str_detect() to find all E01 instances in episode. The str_detect() is incredibly handy for finding patterns in text data because it returns a logical (TRUE or FALSE).
# click to execute code
BobRossStep10 <- dplyr::mutate(.data = BobRossStep10,
episode01 = if_else(condition = str_detect(
string = episode,
pattern = "E01"),
true = TRUE,
false = FALSE))
# count the new values
dplyr::count(BobRossStep10, episode01)We can see the total number of first episodes. Again, this assumes we know the total number of first episodes and that they are all coded correctly.
The dplyr::if_else() function works well with dplyr::mutate() for creating new variables based on a single condition, but sometimes we want to create new variables based on the values in more than one column (i.e., multiple conditions).
In this case, we can combine dplyr::mutate() with dplyr::case_when() to create new variables based on multiple conditions. These functions work together with the following arguments.
dplyr::case_when() with dplyr::mutate()# first we assign a new variable name
dplyr::mutate(.data = DataSet,
`new variable name`,
case_when(
# then we enter our conditions (condition 1)
left hand side condition 1 ~ right hand side result 1,
# condition 2
left hand side condition 2 ~ right hand side result 2))We can learn more about how dplyr::case_when works by accessing the R help files (?dplyr::case_when).
The left-hand side determines which values match this case. The right-hand side provides the replacement value. The left-hand side must evaluate to a logical vector. The right-hand side does not need to be logical, but all right-hand sides must evaluate to the same type of vector.
We will load a small example of BobRoss to experiment with dplyr::case_when(). This dataset only has the first three episodes of season 1.
# click to execute code
BobRossStep11 <- readr::read_csv(file = "https://bit.ly/bob-ross-step11")
head(BobRossStep11)We can see this is a reduced dataset from BobRoss. We will use dplyr::case_when() to create a object_category variable based on what objects were in a particular episode’s painting.
We’ll be using stringr::str_detect() again to find all the paintings of mountains, trees, and bushes, but we’ve dropped the names of arguments, so it’s easier to read.
# click to execute code
dplyr::mutate(.data = BobRossStep11,
object_category = case_when(
season == 1 & str_detect(object, "mountain") ~ "mountains",
season == 1 & str_detect(object, "deciduous") ~ "trees",
season == 1 & str_detect(object, "tree") ~ "trees",
season == 1 & str_detect(object, "conifer") ~ "trees",
season == 1 & str_detect(object, "bush") ~ "bushes"))The great thing about case_when() is that we can keep adding more conditions. For example, we can add water and buildings to the same object_category variable.
# click to execute code
dplyr::mutate(.data = BobRossStep11,
object_category = case_when(
season == 1 & str_detect(object, "mountain") ~ "mountains",
season == 1 & str_detect(object, "deciduous") ~ "trees",
season == 1 & str_detect(object, "tree") ~ "trees",
season == 1 & str_detect(object, "conifer") ~ "trees",
season == 1 & str_detect(object, "bush") ~ "bushes",
season == 1 & str_detect(object, "river") ~ "water",
season == 1 & str_detect(object, "barn") ~ "buildings",
season == 1 & str_detect(object, "cabin") ~ "buildings"))We can also include a ‘catch-all’ with a logical TRUE condition.
# click to execute code
BobRossStep11 <- dplyr::mutate(.data = BobRossStep11,
object_category = case_when(
season == 1 & str_detect(object, "mountain") ~ "mountains",
season == 1 & str_detect(object, "deciduous") ~ "trees",
season == 1 & str_detect(object, "tree") ~ "trees",
season == 1 & str_detect(object, "conifer") ~ "trees",
season == 1 & str_detect(object, "bush") ~ "bushes",
season == 1 & str_detect(object, "river") ~ "water",
season == 1 & str_detect(object, "barn") ~ "buildings",
season == 1 & str_detect(object, "cabin") ~ "buildings",
TRUE ~ "other"))
head(BobRossStep11)case_when and pivotAfter creating new variables, it’s always a good idea to check your work. Now that we have some tools for creating new variables, we can use the tidyr::pivot_ functions to check and make sure all the rows were assigned correctly.
We’ll import a small version of the bob_ross dataset to demonstrate how we can combine both levels of data wrangling to create new variables and check our work.
BobRossStep12 <- readr::read_csv(file = "https://bit.ly/bob-ross-step12")
head(BobRossStep12)Use utils::head(), dplyr::glimpse(), or utils::str() to view the contents of BobRossStep12.
First create a long version of BobRossStep12 using tidyr::pivot_longer(). We’ve provided some code to get you started,
the cols argument should be -c(episode, season, episode_num, title) (this means negate these four columns)
assign names_to to "object" and values_to to "present").
when you’re confident you have it correct, assign these changes to BobRossStep12.
BobRossStep12 <- tidyr::pivot_longer(data = BobRossStep12, cols = -c(episode, season, episode_num, title), names_to = "_______", values_to = "________")BobRossStep12 <- tidyr::pivot_longer(data = BobRossStep12, cols = -c(episode, season, episode_num, title), names_to = "object", values_to = "present")Now that we have a long dataset, we can create a new variable (painting_cats) for the following categories.
Label any mention of cabin as Cabins (there should only be one)
Capture all framed paintings under the label Frames (a single pattern of frame should do the trick)
Create a category for all the Trees (there are different types of trees here, so we can combine multiple arguments by placing a pipe (|) between them)
Label all the clouds as Clouds (same as above, but with different types of clouds)
All other missing values should be labeled NA with NA_character_ (here we will use TRUE as the condition, and the NA type as a result )
dplyr::mutate(.data = BobRossStep12,
painting_cats = case_when(
present == 1 & str_detect(object, "______") ~ "Cabins",
present == 1 & str_detect(object, "______") ~ "Frames",
present == 1 & str_detect(object, "____|_________") ~ "Trees",
present == 1 & str_detect(object, "_____|_______") ~ "Clouds",
TRUE ~ __________))
```{{copy}}
**Did you get it?**
You can run the code below to checkBobRossStep12 <- dplyr::mutate(.data = BobRossStep12,
painting_cats = case_when(
present == 1 & str_detect(object, "cabin") ~ "Cabins",
present == 1 & str_detect(object, "frame") ~ "Frames",
present == 1 & str_detect(object, "tree|deciduous") ~ "Trees",
present == 1 & str_detect(object, "cloud|cumulus") ~ "Clouds",
TRUE ~ NA_character_))
BobRossStep12Now we want to check our work by creating a cross-tabulation between object and painting_cats. We can do this with dplyr::count(), which tallies up all the values in a column. Copy the code below to see the new values in painting_cats.
# click to copy code
# the n column is the tally of the values
dplyr::count(BobRossStep12, painting_cats)If we want to count two variables, we simply separate them with a comma (see below).
# click to copy code
# the n is the tally, but for both variables
dplyr::count(BobRossStep12, object, painting_cats)As you can see, this is returning a tibble. We know how to reshape tibbles with the pivot_ functions now, so let’s restructure the output to view the values of painting_cats across columns.
# click to copy code
# create counts dataset of object and painting_cats
BobRossCounts <- dplyr::count(BobRossStep12, painting_cats, ______)BobRossCounts <- dplyr::count(BobRossStep12, painting_cats, object)Assign the names from our new painting_cats variable and values from n
# click to copy code
# reshape this to wide and use n as the values
pivot_wider(data = BobRossCounts, names_from = painting_cats, values_from = _)pivot_wider(data = BobRossCounts, names_from = painting_cats, values_from = n)We’ve gone over how to:
Load packages and data into R (utils::install.packages(), base::library(), tibble::tribble(), and readr::read_csv()).
how to use R to format and shape (or reformat and reshape) your data (tidyr::pivot_longer(), tidyr::pivot_wider(), tidyr::separate(), and tidyr::unite()).
how to create new variables based on existing columns dplyr::mutate() and dplyr::if_else()
how to create new variables based on multiple conditions (i.e. dplyr::mutate() and dplyr::case_when())
how to check your work with counting and pivoting (dplyr::count() and tidyr::pivot_wider())
The tutorial for building the scenario is here.
I took notes on this tutorial and made them available in this Google document.
There are multiple guidelines and resources for writing scenarios. O’Reilly has provided an Authoring Guide and Formatting and Design Guide.
The link for these files can be found here:
Scenario Readiness Checklist:
When you think your scenario is ready for publication, we recommend you run through the checklist below to ensure it is ready to go. [Reminder: We discuss these best practices in the Katacoda Formatting and Design Guide for Authors]:
Does the scenario start consistently and in a timely manner? More than 5 minutes to start would be cause to reconsider your build decisions.
Are your intro and final pages present and are their respective goals and lessons learned in agreement?
Have you tested your scenario lately?
Do you agree with the leveling (beginner, intermediate, advanced) you indicated when you started the build?
Was learning time you entered correct, or should it be adjusted?
Are your versions of tools and other dependencies up to date?
Have you tried every instruction?
Have you written each step in the most concise manner possible?
Have you run your text through a spelling/grammar checker?
Are your credits to others given present and correct?
Are your images legal and with credits?
Are your hyperlinks all working?
Do the goals and lessons learned items match the steps in the scenario?